sanity checks

The glossary is being gradually proof checked, but currently has many typos and misspellings.

Raw data may have errors, mistypings or other problems. Sanity checks are small, usually relatively simple, pieces of code run across a dataset to verify simple properties; for example whether items indexed in a book all appear in the glossary, or whether a field that is intended to be a number indeed only contains numbers. This is often a critical part of data cleaning (or data wrangling) In addition, when dealing with large quantities of data it can be hard to be sure that transformations, which appear sensible in code, actually do what was intended as only a small portion of the data can be inspected by hand. Sanity checks can again be applied after a transformation stage to help build confidence. Sanity checks often involve simple regular expression matching (e.g. to verify the number field) or looking up values (e.g. to check glossary items are defined).

Used in Chap. 10: pages 132, 142